Library Imports
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
Template
spark = (
SparkSession.builder
.master("local")
.appName("Section 2.12 - Performing Joins (clean one)")
.config("spark.some.config.option", "some-value")
.getOrCreate()
)
sc = spark.sparkContext
pets = spark.createDataFrame(
[
(1, 1, 'Bear'),
(2, 1, 'Chewie'),
(3, 2, 'Roger'),
], ['id', 'breed_id', 'nickname']
)
pets.toPandas()
| id | breed_id | nickname | |
|---|---|---|---|
| 0 | 1 | 1 | Bear |
| 1 | 2 | 1 | Chewie |
| 2 | 3 | 2 | Roger |
breeds = spark.createDataFrame(
[
(1, 'Pitbull', 10),
(2, 'Corgie', 20),
], ['id', 'name', 'average_height']
)
breeds.toPandas()
| id | name | average_height | |
|---|---|---|---|
| 0 | 1 | Pitbull | 10 |
| 1 | 2 | Corgie | 20 |
Performing Joins
There are typically two types of joins in sql:
Inner Joinis where 2 tables are joined on the basis of common columns mentioned in the ON clause.ie.
left.join(right, left[lkey] == right[rkey])
Natural Joinis where 2 tables are joined on the basis of all common columns.ie.
left.join(right, 'key')
source: https://stackoverflow.com/a/8696402
Question: Which is better? Is it just a style choice?
Option 1: Inner Join (w/Different Keys)
join_condition = pets['breed_id'] == breeds['id']
df = pets.join(breeds, join_condition)
df.toPandas()
| id | breed_id | nickname | id | name | average_height | |
|---|---|---|---|---|---|---|
| 0 | 1 | 1 | Bear | 1 | Pitbull | 10 |
| 1 | 2 | 1 | Chewie | 1 | Pitbull | 10 |
| 2 | 3 | 2 | Roger | 2 | Corgie | 20 |
What Happened:
- We have 2 columns named
id, but they refer to different things. - We can't uniquely reference these 2 columns (easily, still possible).
- Pretty long
join expression.
This is not ideal. Let's try renaming it before the join?
Option 2: Inner Join (w/Same Keys)
breeds = breeds.withColumnRenamed('id', 'breed_id')
join_condition = pets['breed_id'] == breeds['breed_id']
df = pets.join(breeds, join_condition)
df.toPandas()
| id | breed_id | nickname | breed_id | name | average_height | |
|---|---|---|---|---|---|---|
| 0 | 1 | 1 | Bear | 1 | Pitbull | 10 |
| 1 | 2 | 1 | Chewie | 1 | Pitbull | 10 |
| 2 | 3 | 2 | Roger | 2 | Corgie | 20 |
What Happened:
- We have 2 columns named
breed_idwhich mean the same thing! - Duplicate columns appear in the result.
- Still pretty long
join expression.
This is again not ideal.
Option 3: Natural Join
df = pets.join(breeds, 'breed_id')
df.toPandas()
| breed_id | id | nickname | name | average_height | |
|---|---|---|---|---|---|
| 0 | 1 | 1 | Bear | Pitbull | 10 |
| 1 | 1 | 2 | Chewie | Pitbull | 10 |
| 2 | 2 | 3 | Roger | Corgie | 20 |
What Happened:
- No duplicated column!
- No extra column!
- A single string required for the
join expression(list of column/keys, if joining on multiple column/keys join).
Summary
Preforming a natural join was the most elegant solution in terms of join expression and the resulting df.
NOTE: These rules also apply to the other join types (ie. left and right).
**Some might argue that you will need both join keys in the result for further transformations such as filter only the left or right key, but I would recommend doing this before the join, as this is more performant.